refactor(supervise): substrate-agnostic TraceSource — sandbox-first trace analysis by drewstone · Pull Request #320 · tangle-network/agent-runtime

drewstone · 2026-06-17T10:08:25Z

Corrects the trace-analysis layer I built router-only. Production is sandbox/fleet — the detectors must run there, and the SDK exposes the tool calls via the session (SessionMessage.parts / streamPrompt), not exportTrace (which is sandbox telemetry — my earlier red herring).

The fix

One interface over agent-eval's ToolSpan (the common currency), two source implementations:

createPushTraceSource — owned loops (router-tools, cli-bridge tool dispatch): the loop records each tool call.
sandboxSessionTraceSource(box, sessionId) — the box: box.messages({sessionId}) → session parts → decodeToolPart (defensive across OpenAI function + harness tool/tool_use shapes) → spans.

Two consumers ride the source: watchTrace (online → finding on the bus) and analyzeTrace (settle → agent-eval batch analyzers buildTrajectory/stuckLoopView/toolWasteView).

Deleted the premature router-only createDetectorMonitor/ToolStep/createTrajectoryRecorder/RecordedToolStep.

Why

This is the §1.5 'author the interface, materialize per substrate' rule I violated by building a router-only onToolStep seam. ToolSpan is the shared currency; a new substrate implements one interface; the same published agent-eval detectors + analyzers run everywhere. Local testing via cli-bridge/router; staging/prod via sandboxes.

Verification

decoder across OpenAI + harness shapes; the sandbox box path end-to-end via a mock box (session parts → loop detected by the batch analyzer); owned-loop push path; online + settle consumers.
full suite 1023 pass; typecheck/build/lint clean; merges cleanly into main.
Honest gap: the exact live harness part-shape is validated on a real box / cli-bridge run — the decoder is defensive but the precise parts schema is confirmed against the running SDK, not the .d.ts.

🤖 Generated with Claude Code

… resume_worker Close the bus to 100% bidirectional. The parent→child down-leg routes to the child inbox (scope.send→deliver) AND records a queue:false event on the same bus: it lands in history() + reaches subscribers for the audit trail, but is never pulled back by the parent. New: resume_worker (continue a parked worker — the protocol had {resume} but no verb); answer_question now routes the answer DOWN to the asking worker, unparking it. EventBus gains PublishOptions.queue for record-only events. down-leg + bidirectional history tests; full suite 1000 pass; typecheck/build/lint clean.

…iew gaps Address PR #318 review: - BLOCKING: answer_question computed `delivered` but returned only { question } — now returns { question, delivered }, consistent with steer_worker/resume_worker (no longer hides whether the answer reached a live worker). - tests: answer routed down to a LIVE worker (delivered:true happy path); resume_worker delivered:false path; a focused event-bus queue:false unit test (history+subscribers see it, pull queue never does). - resume_worker added to OPERATOR_TOOLS + the driver system prompt so the driver is actually prompted to use it.

Make the down-leg actually move a live worker (was observable-only). New createInbox (supervise/inbox.ts) is the receive end an executor exposes as Executor.deliver; the owned tool-loop (routerToolsInlineExecutor) drains it two ways: - QUEUED (default): flush at each step boundary AND before the worker may settle — it can't finish while a steer/answer it never read is pending. - FORCEFUL (steer_worker interrupt:true): aborts the in-flight turn so the worker re-plans immediately, breaking it off a wrong path mid-task. Black-box CLI harnesses can't be interrupted mid-step → down-leg degrades to next spawn. inbox 4 + executor-drains-inbox integration test (flush-before-settle proven end to end through the real executor); full suite 1008 pass; typecheck/build/lint clean.

…sendDown covers answer PR #318 audit follow-ups (non-blocking): - resume_worker description no longer implies a park/resume lifecycle the scope model lacks — a settled (drained) worker is gone; says so and points to spawning fresh. - sendDown now covers the 'answer' down-leg too (removes the inline bus.publish duplication; one helper for all three down kinds). - history() docstring lists the down-leg event kinds. full suite 1008 pass; typecheck/lint clean.

Simplify without losing capability: - MERGE steer_worker + resume_worker → one steer_worker (any live worker; the only real axis was interrupt forceful-vs-queued, already a param). 'Resume' = a non- interrupt steer. Removes a redundant verb + dissolves the resume-vs-steer prompt nits. - REMOVE await_next — it was a strict subset of await_event({kinds:['settled']}). One wait-verb now; callers/prompts pass kinds:['settled'] for the next finished worker. - DROP bus.peek() — speculative, only its own test used it (YAGNI). Down-leg event union + inbox shed the dead 'resume' kind. Full suite 1007 pass; typecheck/build/lint clean.

…gent-eval kernel) createDetectorMonitor (supervise/detector-monitor.ts) — the online analyst on the live worker pipe. Folds each tool step through agent-eval 0.93.0's published streaming kernel (repeatedActionDetector/errorStreakDetector — the SAME kernel control-runtime folds; no detection logic reimplemented) and fires onSignal → a finding on the bus the moment a worker loops or error-storms. routerToolsInlineExecutor feeds it via a new onToolStep seam. Bumps agent-eval ^0.93.0. monitor tests (4); full suite 1011 pass; typecheck/build/lint clean.

Last mile: createCoordinationTools.raiseFinding (exposed on the MCP handle) — the seam an ONLINE detector uses to publish a finding on the live bus mid-run. Proven end-to-end: a stuck-loop on the worker pipe → monitor → raiseFinding → await_event surfaces it. Review fixes (audit on the earlier commit): - HIGH: AbortSignal.any (needs Node 20.3, floor is 20) → portable mergeAbortSignals. - forceful interrupt: docstring no longer overpromises (aborts in-flight inference, a tool mid-exec finishes first); interrupted turns no longer count toward maxTurns; added the e2e test (forceful steer aborts the turn, re-plans, aborted turn is free). - answer to a BLOCKING question is now delivered forcefully (interrupt) to unpark the worker immediately, not at its next boundary. - sendDown 'answer' now REQUIRES questionId (overload; no silent ?? '' mask). - tool-step status captured (error vs ok) for the error-streak detector. - stale await_next purged from bench prompts + docs; history() docstring drops 'resume'. - added tests: answer delivered:false + return asserted; await_event idle-on-mismatch. full suite 1014 pass; typecheck/build/lint clean.

…es agent-eval) createTrajectoryRecorder (supervise/trajectory-recorder.ts) — the post-hoc half of the analyst pipe. Replays a worker's captured tool steps as agent-eval spans (InMemoryTraceStore) and runs its PUBLISHED batch analyzers — buildTrajectory (structured run summary), stuckLoopView (full-run repeated-call view, complementing the online consecutive detector), toolWasteView. No analysis reimplemented; the thin bridge from live tool steps to the substrate trace model. Feeds from the same onToolStep seam as the online monitor. 3 recorder tests (real spans → real agent-eval findings); full suite 1017 pass; typecheck/build/lint clean. Closes both legs: online (mid-run) + settle (post-hoc).

…, comment accuracy) - mergeAbortSignals listener leak: pre-link external signals ONCE; per-turn add+remove the listener (no accumulation on long-lived signals over maxTurns). - interrupt catch now requires a real AbortError (DOMException) — a network fault coincident with an interrupt is no longer swallowed; rethrown. - corrected the comment: an interrupted+re-planned turn DOES consume a maxTurns slot (bounded backstop, not a hang) — it just doesn't bill a turn. - onToolStep is an observability side-channel: wrapped so a throwing monitor can't crash the worker loop; detector-monitor.observeToolStep also defends argHash on circular/unhashable args. - projectEvent preserves questionId on the answer branch. - stale await_next purged from skills/{supervise,loop-writer}; trimmed CLAUDE.md redundancy; softened the recorder's per-span-duration claim. full suite 1018 pass; typecheck/build/lint clean.

… replace router-only seam The detector/analyzer were built router-only (onToolStep/ToolStep) — premature; production is sandbox/fleet. Corrected to one interface over agent-eval's ToolSpan: - TraceSource (trace-source.ts): a worker's tool calls as ToolSpans, from an OWNED loop (createPushTraceSource — router/cli-bridge dispatch) OR a SANDBOX box (sandboxSessionTraceSource(box, sessionId) → box.messages() session parts → decodeToolPart, defensive across OpenAI + harness shapes). The SDK exposes tool calls via the session (SessionMessage.parts / streamPrompt), NOT exportTrace (sandbox telemetry) — corrected. - watchTrace (online) + analyzeTrace (settle) now consume a TraceSource, not a router seam. - DELETED the router-only createDetectorMonitor/ToolStep/createTrajectoryRecorder/RecordedToolStep. Common currency = ToolSpan; same agent-eval detectors + batch analyzers over any substrate. trace-source 11 + watchTrace 5 + analyzeTrace 2 tests incl. the sandbox box path (mock box → session parts → loop detected); full suite 1023 pass; typecheck/build/lint clean. Live-box validation of the exact harness part-shape pending (decoder is defensive).

tangletools

✅ Auto-approved PR — `1e7d7ffc`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-17T10:08:32Z}

drewstone · 2026-06-17T10:09:04Z

Re-opening from a correctly-based branch (this one carried the unsquashed #318 commits → false conflict).

drewstone added 10 commits June 16, 2026 14:36

tangletools approved these changes Jun 17, 2026

View reviewed changes

drewstone closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(supervise): substrate-agnostic TraceSource — sandbox-first trace analysis#320

refactor(supervise): substrate-agnostic TraceSource — sandbox-first trace analysis#320
drewstone wants to merge 10 commits into
mainfrom
feat/trace-source-sandbox

drewstone commented Jun 17, 2026

Uh oh!

tangletools left a comment

Uh oh!

drewstone commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 17, 2026

The fix

Why

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 1e7d7ffc

Uh oh!

drewstone commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `1e7d7ffc`